Audio control using auditory event detection

ABSTRACT

In some embodiments, a method for processing an audio signal in an audio processing apparatus is disclosed. The method includes receiving an audio signal and a parameter, the parameter indicating a location of an auditory event boundary. An audio portion between consecutive auditory event boundaries constitutes an auditory event. The method further includes applying a modification to the audio signal based in part on an occurrence of the auditory event. The parameter may be generated by monitoring a characteristic of the audio signal and identifying a change in the characteristic.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/093,178, filed Nov. 9, 2020, which is a continuation of U.S. patentapplication Ser. No. 16/729,468 filed Dec. 29, 2019, now U.S. Pat. No.10,833,644, which is a divisional of U.S. patent application Ser. No.16/365,947 filed on Mar. 27, 2019, now U.S. Pat. No. 10,523,169, whichis a continuation of U.S. patent application Ser. No. 16/128,642 filedon Sep. 12, 2018, now U.S. Pat. No. 10,284,159, which is a continuationapplication of U.S. patent application Ser. No. 15/809,413 filed on Nov.10, 2017, now U.S. Pat. No. 10,103,700, which is a continuation of U.S.patent application Ser. No. 15/447,564 filed on Mar. 2, 2017, now U.S.Pat. No. 9,866,191, which is a continuation of U.S. patent applicationSer. No. 15/238,820 filed on Aug. 17, 2016, now U.S. Pat. No. 9,685,924,which is a continuation of U.S. patent application Ser. No. 13/850,380filed on Mar. 26, 2013, now U.S. Pat. No. 9,450,551, which is acontinuation of U.S. patent application Ser. No. 13/464,102 filed on May4, 2012, now U.S. Pat. No. 8,428,270, which is a continuation of U.S.patent application Ser. No. 13/406,929 filed on Feb. 28, 2012, now U.S.Pat. No. 9,136,810, which is a continuation of U.S. patent applicationSer. No. 12/226,698 filed on Jan. 19, 2009, now U.S. Pat. No. 8,144,881,which is a national application of PCT application No. PCT/US2007/008313filed Mar. 30, 2007, which claims the benefit of the filing date of U.S.Provisional Patent Application No. 60/795,808 filed on Apr. 27, 2006,all of which are hereby incorporated by reference.

TECHNICAL FIELD

The present invention relates to audio dynamic range control methods andapparatus in which an audio processing device analyzes an audio signaland changes the level, gain or dynamic range of the audio as a functionof auditory events. The invention also relates to computer programs forpracticing such methods or controlling such apparatus.

BACKGROUND ART Dynamics Processing of Audio

The techniques of automatic gain control (AGC) and dynamic range control(DRC) are well known and are a common element of many audio signalpaths. In an abstract sense, both techniques measure the level of anaudio signal in some manner and then gain-modify the signal by an amountthat is a function of the measured level. In a linear, 1:1 dynamicsprocessing system, the input audio is not processed and the output audiosignal ideally matches the input audio signal. Additionally, if one hasan audio dynamics processing system that automatically measurescharacteristics of the input signal and uses that measurement to controlthe output signal, if the input signal rises in level by 6 dB and theoutput signal is processed such that it only rises in level by 3 dB,then the output signal has been compressed by a ratio of 2:1 withrespect to the input signal. International Publication Number WO2006/047600 A1 (“Calculating and Adjusting the Perceived Loudness and/orthe Perceived Spectral Balance of an Audio Signal” by Alan JeffreySeefeldt) provides a detailed overview of the five basic types ofdynamics processing of audio: compression, limiting, automatic gaincontrol (AGC), expansion and gating.

Auditory Events and Auditory Event Detection

The division of sounds into units or segments perceived as separate anddistinct is sometimes referred to as “auditory event analysis” or“auditory scene analysis” (“ASA”) and the segments are sometimesreferred to as “auditory events” or “audio events.” An extensivediscussion of auditory scene analysis is set forth by Albert S. Bregmanin his book Auditory Scene Analysis—The Perceptual Organization ofSound, Massachusetts Institute of Technology, 1991, Fourth printing,2001, Second MIT Press paperback edition). In addition, U.S. Pat. No.6,002,776 to Bhadkamkar, et al, Dec. 14, 1999 cites publications datingback to 1976 as “prior art work related to sound separation by auditoryscene analysis.” However, the Bhadkamkar, et al patent discourages thepractical use of auditory scene analysis, concluding that “[t]echniquesinvolving auditory scene analysis, although interesting from ascientific point of view as models of human auditory processing, arecurrently far too computationally demanding and specialized to beconsidered practical techniques for sound separation until fundamentalprogress is made.”

A useful way to identify auditory events is set forth by Crockett andCrocket et al in various patent applications and papers listed belowunder the heading “Incorporation by Reference.” According to thosedocuments, an audio signal is divided into auditory events, each ofwhich tends to be perceived as separate and distinct, by detectingchanges in spectral composition (amplitude as a function of frequency)with respect to time. This may be done, for example, by calculating thespectral content of successive time blocks of the audio signal,calculating the difference in spectral content between successive timeblocks of the audio signal, and identifying an auditory event boundaryas the boundary between successive time blocks when the difference inthe spectral content between such successive time blocks exceeds athreshold. Alternatively, changes in amplitude with respect to time maybe calculated instead of or in addition to changes in spectralcomposition with respect to time.

In its least computationally demanding implementation, the processdivides audio into time segments by analyzing the entire frequency band(full bandwidth audio) or substantially the entire frequency band (inpractical implementations, band limiting filtering at the ends of thespectrum is often employed) and giving the greatest weight to theloudest audio signal components. This approach takes advantage of apsychoacoustic phenomenon in which at smaller time scales (20milliseconds (ms) and less) the ear may tend to focus on a singleauditory event at a given time. This implies that while multiple eventsmay be occurring at the same time, one component tends to beperceptually most prominent and may be processed individually as thoughit were the only event taking place. Taking advantage of this effectalso allows the auditory event detection to scale with the complexity ofthe audio being processed. For example, if the input audio signal beingprocessed is a solo instrument, the audio events that are identifiedwill likely be the individual notes being played. Similarly for an inputvoice signal, the individual components of speech, the vowels andconsonants for example, will likely be identified as individual audioelements. As the complexity of the audio increases, such as music with adrumbeat or multiple instruments and voice, the auditory event detectionidentifies the “most prominent” (i.e., the loudest) audio element at anygiven moment.

At the expense of greater computational complexity, the process may alsotake into consideration changes in spectral composition with respect totime in discrete frequency subbands (fixed or dynamically determined orboth fixed and dynamically determined subbands) rather than the fullbandwidth. This alternative approach takes into account more than oneaudio stream in different frequency subbands rather than assuming thatonly a single stream is perceptible at a particular time.

Auditory event detection may be implemented by dividing a time domainaudio waveform into time intervals or blocks and then converting thedata in each block to the frequency domain, using either a filter bankor a time-frequency transformation, such as the FFT. The amplitude ofthe spectral content of each block may be normalized in order toeliminate or reduce the effect of amplitude changes. Each resultingfrequency domain representation provides an indication of the spectralcontent of the audio in the particular block. The spectral content ofsuccessive blocks is compared and changes greater than a threshold maybe taken to indicate the temporal start or temporal end of an auditoryevent.

Preferably, the frequency domain data is normalized, as is describedbelow. The degree to which the frequency domain data needs to benormalized gives an indication of amplitude. Hence, if a change in thisdegree exceeds a predetermined threshold that too may be taken toindicate an event boundary. Event start and end points resulting fromspectral changes and from amplitude changes may be ORed together so thatevent boundaries resulting from either type of change are identified.

Although techniques described in said Crockett and Crockett at alapplications and papers are particularly useful in connection withaspects of the present invention, other techniques for identifyingauditory events and event boundaries may be employed in aspects of thepresent invention.

DISCLOSURE OF THE INVENTION

According to one embodiment, a method for processing an audio signal isdisclosed. The, the method includes monitoring a characteristic of theaudio signal, identifying a change in the characteristic, establishingan auditory event boundary to identify the change in the characteristic,wherein an audio portion between consecutive auditory event boundariesconstitutes an auditory event, and the applying a modification to theaudio signal based in part on an occurrence of an auditory event.

In some embodiments, the method operates on an audio signal thatincludes two or more channels of audio content. In these embodiments,the auditory event boundary is identified by examining changes in thecharacteristic between the two or more channels of the audio signal. Inother embodiments, the audio processing method generates one or moredynamically-varying parameters in response to the auditory event.

Typically, an auditory event is a segment of audio that tends to beperceived as separate and distinct. One usable measure of signalcharacteristics includes a measure of the spectral content of the audio,for example, as described in the cited Crockett and Crockett et aldocuments. All or some of the one or more audio dynamics processingparameters may be generated at least partly in response to the presenceor absence and characteristics of one or more auditory events. Anauditory event boundary may be identified as a change in signalcharacteristics with respect to time that exceeds a threshold.Alternatively, all or some of the one or more parameters may begenerated at least partly in response to a continuing measure of thedegree of change in signal characteristics associated with said auditoryevent boundaries. Although, in principle, aspects of the invention maybe implemented in analog and/or digital domains, practicalimplementations are likely to be implemented in the digital domain inwhich each of the audio signals are represented by individual samples orsamples within blocks of data. In this case, the signal characteristicsmay be the spectral content of audio within a block, the detection ofchanges in signal characteristics with respect to time may be thedetection of changes in spectral content of audio from block to block,and auditory event temporal start and stop boundaries each coincide witha boundary of a block of data. It should be noted that for the moretraditional case of performing dynamic gain changes on asample-by-sample basis, that the auditory scene analysis described couldbe performed on a block basis and the resulting auditory eventinformation being used to perform dynamic gain changes that are appliedsample-by-sample.

By controlling key audio dynamics processing parameters using theresults of auditory scene analysis, a dramatic reduction of audibleartifacts introduced by dynamics processing may be achieved.

The present invention presents two ways of performing auditory sceneanalysis. The first performs spectral analysis and identifies thelocation of perceptible audio events that are used to control thedynamic gain parameters by identifying changes in spectral content. Thesecond way transforms the audio into a perceptual loudness domain (thatmay provide more psychoacoustically relevant information than the firstway) and identifies the location of auditory events that aresubsequently used to control the dynamic gain parameters. It should benoted that the second way requires that the audio processing be aware ofabsolute acoustic reproduction levels, which may not be possible in someimplementations. Presenting both methods of auditory scene analysisallows implementations of ASA-controlled dynamic gain modification usingprocesses or devices that may or may not be calibrated to take intoaccount absolute reproduction levels.

In some embodiments, a method for processing an audio signal in an audioprocessing apparatus is disclosed. The method includes receiving theaudio signal and a parameter, the parameter indicating a location of anauditory event boundary. An audio portion between consecutive auditoryevent boundaries constitutes an auditory event. The method furtherincludes applying a modification to the audio signal based in part on anoccurrence of the auditory event. The audio processing apparatus may beimplemented at least in part in hardware and the parameter may begenerated by monitoring a characteristic of the audio signal andidentifying a change in the characteristic.

In some embodiment, a method for processing an audio signal in an audioprocessing apparatus is disclosed. The method includes receiving theaudio signal. The audio signal may comprise at least one channel ofaudio content. The audio signal may be divided into a plurality ofsubband signals with an analysis filterbank. Each of the plurality ofsubband signals may include at least one subband sample. Acharacteristic of the audio signal may be derived. The characteristic isa power measure of the audio signal. The power measure may be smoothedto generate a smoothed power measure of the audio signal, wherein thesmoothing is based on a low-pass filter. A location of an auditory eventboundary may be detected by monitoring the smoothed power measure. Anaudio portion between consecutive auditory event boundaries mayconstitute an auditory event. A gain vector may be generated based onthe location of the auditory event boundary. The gain vector may beapplied to a version of the plurality of subband signals to generatemodified subband signals. The modified subband signals may besynthesized with a synthesis filterbank to produce a modified audiosignal. The audio processing apparatus can implemented at least in partwith hardware.

In some embodiments, the characteristic further includes loudness. Thecharacteristic may include perceived loudness. The characteristic mayfurther include phase. The characteristic may further include a suddenchange in signal power. The audio signal may include two or morechannels of audio content. The auditory event boundary may be identifiedby examining changes in the characteristic between the two or morechannels. The characteristic may include interchannel phase difference.The characteristic may include interchannel correlation. The auditoryevent boundary may coincide with a beginning or end of a block of datain the audio signal. The auditory event boundary may be adjusted tocoincide with a boundary of a block of data in the audio signal.

Aspects of the present invention are described herein in an audiodynamics processing environment that includes aspects of otherinventions. Such other inventions are described in various pendingUnited States and International Patent Applications of DolbyLaboratories Licensing Corporation, the owner of the presentapplication, which applications are identified herein.

DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow chart showing an example of processing steps forperforming auditory scene analysis.

FIG. 2 shows an example of block processing, windowing and performingthe DFT on audio while performing the auditory scene analysis.

FIG. 3 is in the nature of a flow chart or functional block diagram,showing parallel processing in which audio is used to identify auditoryevents and to identify the characteristics of the auditory events suchthat the events and their characteristics are used to modify dynamicsprocessing parameters.

FIG. 4 is in the nature of a flow chart or functional block diagram,showing processing in which audio is used only to identify auditoryevents and the event characteristics are determined from the audio eventdetection such that the events and their characteristics are used tomodify the dynamics processing parameters.

FIG. 5 is in the nature of a flow chart or functional block diagram,showing processing in which audio is used only to identify auditoryevents and the event characteristics are determined from the audio eventdetection and such that only the characteristics of the auditory eventsare used to modify the dynamics processing parameters.

FIG. 6 shows a set idealized auditory filter characteristic responsesthat approximate critical banding on the ERB scale. The horizontal scaleis frequency in Hertz and the vertical scale is level in decibels.

FIG. 7 shows the equal loudness contours of ISO 226. The horizontalscale is frequency in Hertz (logarithmic base 10 scale) and the verticalscale is sound pressure level in decibels.

FIGS. 8 a-c shows idealized input/output characteristics and input gaincharacteristics of an audio dynamic range compressor.

FIGS. 9 a-f show an example of the use of auditory events to control therelease time in a digital implementation of a traditional Dynamic RangeController (DRC) in which the gain control is derived from the Root MeanSquare (RMS) power of the signal.

FIGS. 10 a-f show an example of the use of auditory events to controlthe release time in a digital implementation of a traditional DynamicRange Controller (DRC) in which the gain control is derived from theRoot Mean Square (RMS) power of the signal for an alternate signal tothat used in FIG. 9 .

FIG. 11 depicts a suitable set of idealized AGC and DRC curves for theapplication of AGC followed by DRC in a loudness domain dynamicsprocessing system. The goal of the combination is to make all processedaudio have approximately the same perceived loudness while stillmaintaining at least some of the original audio's dynamics.

BEST MODE FOR CARRYING OUT THE INVENTION Auditory Scene Analysis(Original, Non-Loudness Domain Method)

In accordance with an embodiment of one aspect of the present invention,auditory scene analysis may be composed of four general processing stepsas shown in a portion of FIG. 1 . The first step 1-1 (“Perform SpectralAnalysis”) takes a time-domain audio signal, divides it into blocks andcalculates a spectral profile or spectral content for each of theblocks. Spectral analysis transforms the audio signal into theshort-term frequency domain. This may be performed using any filterbank,either based on transforms or banks of bandpass filters, and in eitherlinear or warped frequency space (such as the Bark scale or criticalband, which better approximate the characteristics of the human ear).With any filterbank there exists a tradeoff between time and frequency.Greater time resolution, and hence shorter time intervals, leads tolower frequency resolution. Greater frequency resolution, and hencenarrower subbands, leads to longer time intervals.

The first step, illustrated conceptually in FIG. 1 calculates thespectral content of successive time segments of the audio signal. In apractical embodiment, the ASA block size may be from any number ofsamples of the input audio signal, although 512 samples provide a goodtradeoff of time and frequency resolution. In the second step 1-2, thedifferences in spectral content from block to block are determined(“Perform spectral profile difference measurements”). Thus, the secondstep calculates the difference in spectral content between successivetime segments of the audio signal. As discussed above, a powerfulindicator of the beginning or end of a perceived auditory event isbelieved to be a change in spectral content. In the third step 1-3(“Identify location of auditory event boundaries”), when the spectraldifference between one spectral-profile block and the next is greaterthan a threshold, the block boundary is taken to be an auditory eventboundary. The audio segment between consecutive boundaries constitutesan auditory event. Thus, the third step sets an auditory event boundarybetween successive time segments when the difference in the spectralprofile content between such successive time segments exceeds athreshold, thus defining auditory events. In this embodiment, auditoryevent boundaries define auditory events having a length that is anintegral multiple of spectral profile blocks with a minimum length ofone spectral profile block (512 samples in this example). In principle,event boundaries need not be so limited. As an alternative to thepractical embodiments discussed herein, the input block size may vary,for example, so as to be essentially the size of an auditory event.

Following the identification of the event boundaries, keycharacteristics of the auditory event are identified, as shown in step1-4.

Either overlapping or non-overlapping segments of the audio may bewindowed and used to compute spectral profiles of the input audio.Overlap results in finer resolution as to the location of auditoryevents and, also, makes it less likely to miss an event, such as a shorttransient. However, overlap also increases computational complexity.Thus, overlap may be omitted. FIG. 2 shows a conceptual representationof non-overlapping N sample blocks being windowed and transformed intothe frequency domain by the Discrete Fourier Transform (DFT). Each blockmay be windowed and transformed into the frequency domain, such as byusing the DFT, preferably implemented as a Fast Fourier Transform (FFT)for speed.

The following variables may be used to compute the spectral profile ofthe input block:

-   -   M=number of windowed samples in a block used to compute spectral        profile    -   P=number of samples of spectral computation overlap

In general, any integer numbers may be used for the variables above.However, the implementation will be more efficient if M is set equal toa power of 2 so that standard FFTs may be used for the spectral profilecalculations. In a practical embodiment of the auditory scene analysisprocess, the parameters listed may be set to:

-   -   M=512 samples (or 11.6 ms at 44.1 kHz)    -   P=0 samples (no overlap)

The above-listed values were determined experimentally and were foundgenerally to identify with sufficient accuracy the location and durationof auditory events. However, setting the value of P to 256 samples (50%overlap) rather than zero samples (no overlap) has been found to beuseful in identifying some hard-to-find events. While many differenttypes of windows may be used to minimize spectral artifacts due towindowing, the window used in the spectral profile calculations is anM-point Hanning, Kaiser-Bessel or other suitable, preferablynon-rectangular, window. The above-indicated values and a Hanning windowtype were selected after extensive experimental analysis as they haveshown to provide excellent results across a wide range of audiomaterial. Non-rectangular windowing is preferred for the processing ofaudio signals with predominantly low frequency content. Rectangularwindowing produces spectral artifacts that may cause incorrect detectionof events. Unlike certain encoder/decoder (codec) applications where anoverall overlap/add process must provide a constant level, such aconstraint does not apply here and the window may be chosen forcharacteristics such as its time/frequency resolution and stop-bandrejection.

In step 1-1 (FIG. 1 ), the spectrum of each M-sample block may becomputed by windowing the data with an M-point Hanning, Kaiser-Bessel orother suitable window, converting to the frequency domain using anM-point Fast Fourier Transform, and calculating the magnitude of thecomplex FFT coefficients. The resultant data is normalized so that thelargest magnitude is set to unity, and the normalized array of M numbersis converted to the log domain. The data may also be normalized by someother metric such as the mean magnitude value or mean power value of thedata. The array need not be converted to the log domain, but theconversion simplifies the calculation of the difference measure in step1-2. Furthermore, the log domain more closely matches the nature of thehuman auditory system. The resulting log domain values have a range ofminus infinity to zero. In a practical embodiment, a lower limit may beimposed on the range of values; the limit may be fixed, for example −60dB, or be frequency-dependent to reflect the lower audibility of quietsounds at low and very high frequencies. (Note that it would be possibleto reduce the size of the array to M/2 in that the FFT representsnegative as well as positive frequencies).

Step 1-2 calculates a measure of the difference between the spectra ofadjacent blocks. For each block, each of the M (log) spectralcoefficients from step 1-1 is subtracted from the correspondingcoefficient for the preceding block, and the magnitude of the differencecalculated (the sign is ignored). These M differences are then summed toone number. This difference measure may also be expressed as an averagedifference per spectral coefficient by dividing the difference measureby the number of spectral coefficients used in the sum (in this case Mcoefficients).

Step 1-3 identifies the locations of auditory event boundaries byapplying a threshold to the array of difference measures from step 1-2with a threshold value. When a difference measure exceeds a threshold,the change in spectrum is deemed sufficient to signal a new event andthe block number of the change is recorded as an event boundary. For thevalues of M and P given above and for log domain values (in step 1-1)expressed in units of dB, the threshold may be set equal to 2500 if thewhole magnitude FFT (including the mirrored part) is compared or 1250 ifhalf the FFT is compared (as noted above, the FFT represents negative aswell as positive frequencies—for the magnitude of the FFT, one is themirror image of the other). This value was chosen experimentally and itprovides good auditory event boundary detection. This parameter valuemay be changed to reduce (increase the threshold) or increase (decreasethe threshold) the detection of events.

The process of FIG. 1 may be represented more generally by theequivalent arrangements of FIGS. 3, 4 and 5 . In FIG. 3 , an audiosignal is applied in parallel to an “Identify Auditory Events” functionor step 3-1 that divides the audio signal into auditory events, each ofwhich tends to be perceived as separate and distinct and to an optional“Identify Characteristics of Auditory Events” function or step 3-2. Theprocess of FIG. 1 may be employed to divide the audio signal intoauditory events and their characteristics identified or some othersuitable process may be employed. The auditory event information, whichmay be an identification of auditory event boundaries, determined byfunction or step 3-1 is then used to modify the audio dynamicsprocessing parameters (such as attack, release, ratio, etc.), asdesired, by a “Modify Dynamics Parameters” function or step 3-3. Theoptional “Identify Characteristics” function or step 3-3 also receivesthe auditory event information. The “Identify Characteristics” functionor step 3-3 may characterize some or all of the auditory events by oneor more characteristics. Such characteristics may include anidentification of the dominant subband of the auditory event, asdescribed in connection with the process of FIG. 1 . The characteristicsmay also include one or more audio characteristics, including, forexample, a measure of power of the auditory event, a measure ofamplitude of the auditory event, a measure of the spectral flatness ofthe auditory event, and whether the auditory event is substantiallysilent, or other characteristics that help modify dynamics parameterssuch that negative audible artifacts of the processing are reduced orremoved. The characteristics may also include other characteristics suchas whether the auditory event includes a transient.

Alternatives to the arrangement of FIG. 3 are shown in FIGS. 4 and 5 .In FIG. 4 , the audio input signal is not applied directly to the“Identify Characteristics” function or step 4-3, but it does receiveinformation from the “Identify Auditory Events” function or step 4-1.The arrangement of FIG. 1 is a specific example of such an arrangement.In FIG. 5 , the functions or steps 5-1, 5-2 and 5-3 are arranged inseries.

The details of this practical embodiment are not critical. Other ways tocalculate the spectral content of successive time segments of the audiosignal, calculate the differences between successive time segments, andset auditory event boundaries at the respective boundaries betweensuccessive time segments when the difference in the spectral profilecontent between such successive time segments exceeds a threshold may beemployed.

Auditory Scene Analysis (New, Loudness Domain Method)

International application under the Patent Cooperation Treaty S.N.PCT/US2005/038579, filed Oct. 25, 2005, published as InternationalPublication Number WO 2006/047600 A1, entitled “Calculating andAdjusting the Perceived Loudness and/or the Perceived Spectral Balanceof an Audio Signal” by Alan Jeffrey Seefeldt discloses, among otherthings, an objective measure of perceived loudness based on apsychoacoustic model. Said application is hereby incorporated byreference in its entirety. As described in said application, from anaudio signal, x[n], an excitation signal E[b,t] is computed thatapproximates the distribution of energy along the basilar membrane ofthe inner ear at critical band b during time block t. This excitationmay be computed from the Short-time Discrete Fourier Transform (STDFT)of the audio signal as follows:

$\begin{matrix}{{E\lbrack {b,t} \rbrack} = {{\lambda_{b}{E\lbrack {b,{t - 1}} \rbrack}} + {( {1 - \lambda_{b}} ){\sum\limits_{k}{{❘{T\lbrack k\rbrack}❘}^{2}{❘{C_{b}\lbrack k\rbrack}❘}^{2}{❘{X\lbrack {k,t} \rbrack}❘}^{2}}}}}} & (1)\end{matrix}$where X[k,t] represents the STDFT of x[n] at time block t and bin k.Note that in equation 1 t represents time in discrete units of transformblocks as opposed to a continuous measure, such as seconds. T[k]represents the frequency response of a filter simulating thetransmission of audio through the outer and middle ear, and C_(b)[k]represents the frequency response of the basilar membrane at a locationcorresponding to critical band b. FIG. 6 depicts a suitable set ofcritical band filter responses in which 40 bands are spaced uniformlyalong the Equivalent Rectangular Bandwidth (ERB) scale, as defined byMoore and Glasberg. Each filter shape is described by a roundedexponential function and the bands are distributed using a spacing of 1ERB. Lastly, the smoothing time constant λ_(b) in equation 1 may beadvantageously chosen proportionate to the integration time of humanloudness perception within band b.

Using equal loudness contours, such as those depicted in FIG. 7 , theexcitation at each band is transformed into an excitation level thatwould generate the same perceived loudness at 1 kHz. Specific loudness,a measure of perceptual loudness distributed across frequency and time,is then computed from the transformed excitation, E_(1kHz)[b, t],through a compressive non-linearity. One such suitable function tocompute the specific loudness N[b,t] is given by:

$\begin{matrix}{{N\lbrack {b,t} \rbrack} = {\beta( {( \frac{E_{1kHz}\lbrack {b,t} \rbrack}{TQ_{1kHz}} )^{\alpha} - 1} )}} & (2)\end{matrix}$where TQ_(1kHz) is the threshold in quiet at 1 kHz and the constants βand α are chosen to match growth of loudness data as collected fromlistening experiments. Abstractly, this transformation from excitationto specific loudness may be presented by the function Ψ{ } such that:N[b,t]=Ψ{E[b,t]}Finally, the total loudness, L[t], represented in units of sone, iscomputed by summing the specific loudness across bands:

$\begin{matrix}{{L\lbrack t\rbrack} = {\sum\limits_{b}{N\lbrack {b,t} \rbrack}}} & (3)\end{matrix}$

The specific loudness N[b,t] is a spectral representation meant tosimulate the manner in which a human perceives audio as a function offrequency and time. It captures variations in sensitivity to differentfrequencies, variations in sensitivity to level, and variations infrequency resolution. As such, it is a spectral representation wellmatched to the detection of auditory events. Though more computationallycomplex, comparing the difference of N[b,t] across bands betweensuccessive time blocks may in many cases result in more perceptuallyaccurate detection of auditory events in comparison to the direct use ofsuccessive FFT spectra described above.

In said patent application, several applications for modifying the audiobased on this psychoacoustic loudness model are disclosed. Among theseare several dynamics processing algorithms, such as AGC and DRC. Thesedisclosed algorithms may benefit from the use of auditory events tocontrol various associated parameters. Because specific loudness isalready computed, it is readily available for the purpose of detectingsaid events. Details of a preferred embodiment are discussed below.

Audio Dynamics Processing Parameter Control with Auditory Events

Two examples of embodiments of the invention are now presented. Thefirst describes the use of auditory events to control the release timein a digital implementation of a Dynamic Range Controller (DRC) in whichthe gain control is derived from the Root Mean Square (RMS) power of thesignal. The second embodiment describes the use of auditory events tocontrol certain aspects of a more sophisticated combination of AGC andDRC implemented within the context of the psychoacoustic loudness modeldescribed above. These two embodiments are meant to serve as examples ofthe invention only, and it should be understood that the use of auditoryevents to control parameters of a dynamics processing algorithm is notrestricted to the specifics described below.

Dynamic Range Control

The described digital implementation of a DRC segments an audio signalx[n] into windowed, half-overlapping blocks, and for each block amodification gain based on a measure of the signal's local power and aselected compression curve is computed. The gain is smoothed acrossblocks and then multiplied with each block. The modified blocks arefinally overlap-added to generate the modified audio signal y[n].

It should be noted, that while the auditory scene analysis and digitalimplementation of DRC as described here divides the time-domain audiosignal into blocks to perform analysis and processing, the DRCprocessing need not be performed using block segmentation. For examplethe auditory scene analysis could be performed using block segmentationand spectral analysis as described above and the resulting auditoryevent locations and characteristics could be used to provide controlinformation to a digital implementation of a traditional DRCimplementation that typically operates on a sample-by-sample basis.Here, however, the same blocking structure used for auditory sceneanalysis is employed for the DRC to simplify the description of theircombination. Proceeding with the description of a block based DRCimplementation, the overlapping blocks of the audio signal may berepresented as:x[n,t]=w[n]x[n+tM/2] for 0<n<M−1   (4)where M is the block length and the hopsize is M/2, w[n] is the window,n is the sample index within the block, and t is the block index (notethat here t is used in the same way as with the STDFT in equation 1; itrepresents time in discrete units of blocks rather than seconds, forexample). Ideally, the window w[n] tapers to zero at both ends and sumsto unity when half-overlapped with itself; the commonly used sine windowmeets these criteria, for example.

For each block, one may then compute the RMS power to generate a powermeasure P[t] in dB per block:

$\begin{matrix}{{P\lbrack t\rbrack} = {10 \star {\log 10( {\frac{1}{M}{\sum\limits_{n = 1}^{M}{x^{2}\lbrack {n,t} \rbrack}}} )}}} & (5)\end{matrix}$As mentioned earlier, one could smooth this power measure with a fastattack and slow release prior to processing with a compression curve,but as an alternative the instantaneous power P[t] is processed and theresulting gain is smoothed. This alternate approach has the advantagethat a simple compression curve with sharp knee points may be used, butthe resulting gains are still smooth as the power travels through theknee-point. Representing a compression curve as shown in FIG. 8 c as afunction F of signal level that generates a gain, the block gain G[t] isgiven by:G[t]=F{P[t]}   (6)Assuming that the compression curve applies greater attenuation assignal level increases, the gain will be decreasing when the signal isin “attack mode” and increasing when in “release mode”. Therefore, asmoothed gain G [t] may be computed according to:G[t]=α[t]·G[t−1]+(1−α[t])G[t]   (7a)where

$\begin{matrix}{{\alpha\lbrack t\rbrack} = \{ {\begin{matrix}\alpha_{attach} & {{G\lbrack t\rbrack} < {\overset{\_}{G}\lbrack {t - 1} \rbrack}} \\\alpha_{release} & {{G\lbrack t\rbrack} \geq {\overset{\_}{G}\lbrack {t - 1} \rbrack}}\end{matrix}.} } & ( {7b} )\end{matrix}$andα_(release)>>α_(attach)   (7c)Finally, the smoothed gain G[t], which is in dB, is applied to eachblock of the signal, and the modified blocks are overlap-added toproduce the modified audio:y[n+tM/2]=(10 ^(G[t]/20) )x[n,t]+(10 ^(G[t−1]/20) )x[n+M/2,t−1] for0<n<M/2   (8)Note that because the blocks have been multiplied with a tapered window,as shown in equation 4, the overlap-add synthesis shown aboveeffectively smooths the gains across samples of the processed signaly[n]. Thus, the gain control signal receives smoothing in addition tothat in shown in equation 7a. In a more traditional implementation ofDRC operating sample-by-sample rather than block-by-block, gainsmoothing more sophisticated than the simple one-pole filter shown inequation 7a might be necessary in order to prevent audible distortion inthe processed signal. Also, the use of block based processing introducesan inherent delay of M/2 samples into the system, and as long as thedecay time associated with α_(attack) is close to this delay, the signalx[n] does not need to be delayed further before the application of thegains for the purposes of preventing overshoot.

FIGS. 9 a through 9 c depict the result of applying the described DRCprocessing to an audio signal. For this particular implementation, ablock length of M=512 is used at a sampling rate of 44.1 kHz. Acompression curve similar to the one shown in FIG. 8 b is used:

above −20 dB relative to full scale digital the signal is attenuatedwith a ratio of 5:1, and below

−30 dB the signal is boosted with a ratio of 5:1. The gain is smoothedwith an attack coefficient α_(attack) corresponding to a half-decay timeof 10 ms and a release coefficient α_(release) corresponding to ahalf-decay time of 500 ms. The original audio signal depicted in FIG. 9a consists of six consecutive piano chords, with the final chord,located around sample 1.75×10⁵, decaying into silence. Examining a plotof the gain G[t] in FIG. 9 b , it should be noted that the gain remainsclose to 0 dB while the six chords are played. This is because thesignal energy remains, for the most part, between −30 dB and −20 dB, theregion within which the DRC curve calls for no modification. However,after the hit of the last chord, the signal energy falls below −30 dB,and the gain begins to rise, eventually beyond 15 dB, as the chorddecays. FIG. 9 c depicts the resulting modified audio signal, and onecan see that the tail of the final chord is boosted significantly.Audibly, this boosting of the chord's natural, low-level decay soundcreates an extremely unnatural result. It is the aim of the presentinvention to prevent problems of this type that are associated with atraditional dynamics processor.

FIGS. 10 a through 10 c depict the results of applying the exact sameDRC system to a different audio signal. In this case the first half ofthe signal consists of an up-tempo music piece at a high level, and thenat approximately sample 10×10⁴ the signal switches to a second up-tempomusic piece, but at a significantly lower level. Examining the gain inFIG. 6 b , one sees that the signal is attenuated by approximately 10 dBduring the first half, and then the gain rises back up to 0 dB duringthe second half when the softer piece is playing. In this case, the gainbehaves as desired. One would like the second piece to be boostedrelative to the first, and the gain should increase quickly after thetransition to the second piece to be audibly unobtrusive. One sees again behavior that is similar to that for the first signal discussed,but here the behavior is desirable. Therefore, one would like to fix thefirst case without affecting the second. The use of auditory events tocontrol the release time of this DRC system provides such a solution.

In the first signal that was examined in FIG. 9 , the boosting of thelast chord's decay seems unnatural because the chord and its decay areperceived as a single auditory event whose integrity is expected to bemaintained. In the second case, however, many auditory events occurwhile the gain increases, meaning that for any individual event, littlechange is imparted. Therefore the overall gain change is not asobjectionable. One may therefore argue that a gain change should beallowed only in the near temporal vicinity of an auditory eventboundary. One could apply this principal to the gain while it is ineither attack or release mode, but for most practical implementations ofa DRC, the gain moves so quickly in attack mode in comparison to thehuman temporal resolution of event perception that no control isnecessary. One may therefore use events to control smoothing of the DRCgain only when it is in release mode.

A suitable behavior of the release control is now described. Inqualitative terms, if an event is detected, the gain is smoothed withthe release time constant as specified above in Equation 7a. As timeevolves past the detected event, and if no subsequent events aredetected, the release time constant continually increases so thateventually the smoothed gain is “frozen” in place. If another event isdetected, then the smoothing time constant is reset to the originalvalue and the process repeats. In order to modulate the release time,one may first generate a control signal based on the detected eventboundaries.

As discussed earlier, event boundaries may be detected by looking forchanges in successive spectra of the audio signal. In this particularimplementation, the DFT of each overlapping block x[n,t] may be computedto generate the STDFT of the audio signal x[n]:

$\begin{matrix}{{X\lbrack {k,t} \rbrack} = {\sum\limits_{n = 0}^{M - 1}{{x\lbrack {n,t} \rbrack}e^{{- j}\frac{2\pi kn}{M}}}}} & (9)\end{matrix}$Next, the difference between the normalized log magnitude spectra ofsuccessive blocks may be computed according to:

$\begin{matrix}{{D\lbrack t\rbrack} = {\sum\limits_{k}{❘{{X_{NORM}\lbrack {k,t} \rbrack} - {X_{NORM}\lbrack {k,{t - 1}} \rbrack}}❘}}} & ( {10a} )\end{matrix}$where

$\begin{matrix}{{X_{NORM}\lbrack {k,t} \rbrack} = {\log( \frac{❘{X\lbrack {k,t} \rbrack}❘}{\max\limits_{k}\{ {❘{X\lbrack {k,t} \rbrack}❘} \}} )}} & ( {10b} )\end{matrix}$Here the maximum of |X[k,t]| across bins k is used for normalization,although one might employ other normalization factors; for example, theaverage of |X[k,t]| across bins. If the difference D[t] exceeds athreshold D_(min), then an event is considered to have occurred.Additionally, one may assign a strength to this event, lying betweenzero and one, based on the size of D[t] in comparison to a maximumthreshold D_(max). The resulting auditory event strength signal A[t] maybe computed as:

$\begin{matrix}{{A\lbrack t\rbrack} = \{ \begin{matrix}0 & {{D\lbrack t\rbrack} \leq D_{min}} \\\frac{{D\lbrack t\rbrack} - D_{min}}{D_{max} - D_{min}} & {D_{min} < {D\lbrack t\rbrack} < D_{max}} \\1 & {{D\lbrack t\rbrack} \geq D_{max}}\end{matrix} } & (11)\end{matrix}$By assigning a strength to the auditory event proportional to the amountof spectral change associated with that event, greater control over thedynamics processing is achieved in comparison to a binary eventdecision. The inventors have found that larger gain changes areacceptable during stronger events, and the signal in equation 11 allowssuch variable control.

The signal A[t] is an impulsive signal with an impulse occurring at thelocation of an event boundary. For the purposes of controlling therelease time, one may further smooth the signal A[t] so that it decayssmoothly to zero after the detection of an event boundary. The smoothedevent control signal A[t] may be computed from A[t] according to:

$\begin{matrix}{{\overset{\_}{A}\lbrack t\rbrack} = \{ \begin{matrix}{A\lbrack t\rbrack} & {{A\lbrack t\rbrack} > {\alpha_{event}{\overset{\_}{A}\lbrack {t - 1} \rbrack}}} \\{\alpha_{event}{\overset{\_}{A}\lbrack {t - 1} \rbrack}} & {otherwise}\end{matrix} } & (12)\end{matrix}$Here α_(event) controls the decay time of the event control signal.FIGS. 9 d and 10 d depict the event control signal Ā[t] for the twocorresponding audio signals, with the half-decay time of the smootherset to 250 ms. In the first case, one sees that an event boundary isdetected for each of the six piano chords, and that the event controlsignal decays smoothly towards zero after each event. For the secondsignal, many events are detected very close to each other in time, andtherefore the event control signal never decays fully to zero.

One may now use the event control signal Ā[t] to vary the release timeconstant used for smoothing the gain. When the control signal is equalto one, the smoothing coefficient α[t] from Equation 7a equalsα_(release), as before, and when the control signal is equal to zero,the coefficient equals one so that the smoothed gain is prevented fromchanging. The smoothing coefficient is interpolated between these twoextremes using the control signal according to:

$\begin{matrix}{{\alpha\lbrack t\rbrack} = \{ \begin{matrix}\alpha_{attack} & {{G\lbrack t\rbrack} < {\overset{\_}{G}\lbrack {t - 1} \rbrack}} \\{{{\overset{\_}{A}\lbrack t\rbrack}\alpha_{release}} + ( {1 - {\overset{\_}{A}\lbrack t\rbrack}} )} & {{G\lbrack t\rbrack} \geq {\overset{\_}{G}\lbrack {t - 1} \rbrack}}\end{matrix} } & (13)\end{matrix}$By interpolating the smoothing coefficient continuously as a function ofthe event control signal, the release time is reset to a valueproportionate to the event strength at the onset of an event and thenincreases smoothly to infinity after the occurrence of an event. Therate of this increase is dictated by the coefficient α_(event) used togenerate the smoothed event control signal.

FIGS. 9 e and 10 e show the effect of smoothing the gain with theevent-controlled coefficient from Equation 13 as opposed tonon-event-controlled coefficient from Equation 7b. In the first case,the event control signal falls to zero after the last piano chord,thereby preventing the gain from moving upwards. As a result, thecorresponding modified audio in FIG. 9 f does not suffer from anunnatural boost of the chord's decay. In the second case, the eventcontrol signal never approaches zero, and therefore the smoothed gainsignal is inhibited very little through the application of the eventcontrol. The trajectory of the smoothed gain is nearly identical to thenon-event-controlled gain in FIG. 10 b . This is exactly the desiredeffect.

Loudness Based AGC and DRC

As an alternative to traditional dynamics processing techniques wheresignal modifications are a direct function of simple signal measurementssuch as Peak or RMS power, International Patent Application S.N.PCT/US2005/038579 discloses use of the psychoacoustic based loudnessmodel described earlier as a framework within which to perform dynamicsprocessing. Several advantages are cited. First, measurements andmodifications are specified in units of sone, which is a more accuratemeasure of loudness perception than more basic measures such as Peak orRMS power. Secondly, the audio may be modified such that the perceivedspectral balance of the original audio is maintained as the overallloudness is changed. This way, changes to the overall loudness becomeless perceptually apparent in comparison to a dynamics processor thatutilizes a wideband gain, for example, to modify the audio. Lastly, thepsychoacoustic model is inherently multi-band, and therefore the systemis easily configured to perform multi-band dynamics processing in orderto alleviate the well-known cross-spectral pumping problems associatedwith a wideband dynamics processor.

Although performing dynamics processing in this loudness domain alreadyholds several advantages over more traditional dynamics processing, thetechnique may be further improved through the use of auditory events tocontrol various parameters. Consider the audio segment containing pianochords as depicted in 27 a and the associated DRC shown in FIGS. 10 band c . One could perform a similar DRC in the loudness domain, and inthis case, when the loudness of the final piano chord's decay isboosted, the boost would be less apparent because the spectral balanceof the decaying note would be maintained as the boost is applied.However, a better solution is to not boost the decay at all, andtherefore one may advantageously apply the same principle of controllingattack and release times with auditory events in the loudness domain aswas previously described for the traditional DRC.

The loudness domain dynamics processing system that is now describedconsists of AGC followed by DRC. The goal of this combination is to makeall processed audio have approximately the same perceived loudness whilestill maintaining at least some of the original audio's dynamics. FIG.11 depicts a suitable set of AGC and DRC curves for this application.Note that the input and output of both curves is represented in units ofsone since processing is performed in the loudness domain. The AGC curvestrives to bring the output audio closer to some target level, and, asmentioned earlier, does so with relatively slow time constants. One maythink of the AGC as making the long-term loudness of the audio equal tothe target, but on a short-term basis, the loudness may fluctuatesignificantly around this target. Therefore, one may employ fasteracting DRC to limit these fluctuations to some range deemed acceptablefor the particular application. FIG. 11 shows such a DRC curve where theAGC target falls within the “null band” of the DRC, the portion of thecurve that calls for no modification. With this combination of curves,the AGC places the long-term loudness of the audio within the null-bandof the DRC curve so that minimal fast-acting DRC modifications need beapplied. If the short-term loudness still fluctuates outside of thenull-band, the DRC then acts to move the loudness of the audio towardsthis null-band. As a final general note, one may apply the slow actingAGC such that all bands of the loudness model receive the same amount ofloudness modification, thereby maintaining the perceived spectralbalance, and one may apply the fast acting DRC in a manner that allowsthe loudness modification to vary across bands in order alleviatecross-spectral pumping that might otherwise result from fast actingband-independent loudness modification.

Auditory events may be utilized to control the attack and release ofboth the AGC and DRC. In the case of AGC, both the attack and releasetimes are large in comparison to the temporal resolution of eventperception, and therefore event control may be advantageously employedin both cases. With the DRC, the attack is relatively short, andtherefore event control may be needed only for the release as with thetraditional DRC described above.

As discussed earlier, one may use the specific loudness spectrumassociated with the employed loudness model for the purposes of eventdetection. A difference signal D[t], similar to the one in Equations 10aand b may be computed from the specific loudness N[b,t], defined inEquation 2, as follows:

$\begin{matrix}{{D\lbrack t\rbrack} = {\sum\limits_{b}{❘{{N_{NORM}\lbrack {b,t} \rbrack} - {N_{NORM}\lbrack {b,{t - 1}} \rbrack}}❘}}} & ( {14a} )\end{matrix}$where

$\begin{matrix}{{N_{NORM}\lbrack {b,t} \rbrack} = \frac{N\lbrack {b,t} \rbrack}{\max\limits_{b}\{ {N\lbrack {b,t} \rbrack} \}}} & ( {14b} )\end{matrix}$Here the maximum of |N[b,t]| across frequency bands b is used fornormalization, although one might employ other normalization factors;for example, the average of |N[b,t] across frequency bands. If thedifference D[t] exceeds a threshold D_(min), then an event is consideredto have occurred. The difference signal may then be processed in thesame way shown in Equations 11 and 12 to generate a smooth event controlsignal Ā[t] used to control the attack and release times.

The AGC curve depicted in FIG. 11 may be represented as a function thattakes as its input a measure of loudness and generates a desired outputloudness:L _(o) =F _(AGC) {L _(i)}   (15a)The DRC curve may be similarly represented:L _(o) =F _(DRC) {L _(i)}   (15b)For the AGC, the input loudness is a measure of the audio's long-termloudness. One may compute such a measure by smoothing the instantaneousloudness L[t], defined in Equation 3, using relatively long timeconstants (on the order of several seconds). It has been shown that injudging an audio segment's long term loudness, humans weight the louderportions more heavily than the softer, and one may use a faster attackthan release in the smoothing to simulate this effect. With theincorporation of event control for both the attack and release, thelong-term loudness used for determining the AGC modification maytherefore be computed according to:L _(AGC) [t]=α _(AGC) [t]L _(AGC) [t−1]+(1−α_(AGC) [t])L[t]   (16a)where

$\begin{matrix}{{\alpha_{AGC}\lbrack t\rbrack} = \{ {\begin{matrix}{{{\overset{\_}{A}\lbrack t\rbrack}\alpha_{AGC{attach}}} + ( {1 - {\overset{\_}{A}\lbrack t\rbrack}} )} & {{L\lbrack t\rbrack} > {L_{AGC}\lbrack {t - 1} \rbrack}} \\{{{\overset{\_}{A}\lbrack t\rbrack}\alpha_{AGCrelease}} + ( {1 - {\overset{\_}{A}\lbrack t\rbrack}} )} & {{L\lbrack t\rbrack} \leq {L_{AGC}\lbrack {t - 1} \rbrack}}\end{matrix}.} } & ( {16b} )\end{matrix}$In addition, one may compute an associated long-term specific loudnessspectrum that will later be used for the multi-band DRC:N _(AGC) [b,t]=α _(AGC) [t]N _(AGC) [b,t−1]+(1−α_(AGC) [t])N[b,t]  (16c)In practice one may choose the smoothing coefficients such that theattack time is approximately half that of the release. Given thelong-term loudness measure, one may then compute the loudnessmodification scaling associated with the AGC as the ratio of the outputloudness to input loudness:

$\begin{matrix}{{S_{AGC}\lbrack t\rbrack} = \frac{F_{AGC}\{ {L_{AGC}\lbrack t\rbrack} \}}{L_{AGC}\lbrack t\rbrack}} & (17)\end{matrix}$

The DRC modification may now be computed from the loudness after theapplication of the AGC scaling. Rather than smooth a measure of theloudness prior to the application of the DRC curve, one mayalternatively apply the DRC curve to the instantaneous loudness and thensubsequently smooth the resulting modification. This is similar to thetechnique described earlier for smoothing the gain of the traditionalDRC. In addition, the DRC may be applied in a multi-band fashion,meaning that the DRC modification is a function of the specific loudnessN[b,t] in each band b, rather than the overall loudness L[t]. However,in order to maintain the average spectral balance of the original audio,one may apply DRC to each band such that the resulting modificationshave the same average effect as would result from applying DRC to theoverall loudness.

This may be achieved by scaling each band by the ratio of the long-termoverall loudness (after the application of the AGC scaling) to thelong-term specific loudness, and using this value as the argument to theDRC function. The result is then rescaled by the inverse of said ratioto produce the output specific loudness. Thus, the DRC scaling in eachband may be computed according to:

$\begin{matrix}{{S_{DRC}\lbrack {b,t} \rbrack} = \frac{\frac{N_{AGC}\lbrack {b,t} \rbrack}{{S_{AGC}\lbrack t\rbrack}{L_{AGC}\lbrack t\rbrack}}F_{DRC}\{ {\frac{{S_{AGC}\lbrack t\rbrack}{L_{AGC}\lbrack t\rbrack}}{N_{AGC}\lbrack t\rbrack}{N\lbrack {b,t} \rbrack}} \}}{N\lbrack {b,t} \rbrack}} & (18)\end{matrix}$The AGC and DRC modifications may then be combined to form a totalloudness scaling per band:S _(TOT) [b,t]=S _(AGC) [t]S _(DRC) [b,t]  (19)This total scaling may then be smoothed across time independently foreach band with a fast attack and slow release and event control appliedto the release only. Ideally smoothing is performed on the logarithm ofthe scaling analogous to the gains of the traditional DRC being smoothedin their decibel representation, though this is not essential. To ensurethat the smoothed total scaling moves in sync with the specific loudnessin each band, attack and release modes may by determined through thesimultaneous smoothing of specific loudness itself:S _(TOT) [b,t]=exp(α_(TOT) [b,t]log( S _(TOT) [b,t−1])+(1−α_(TOT)[b,t])log(S _(TOT) [b,t]))  (20a)N[b,t]=α _(TOT) [b,t]N[b,t−1]+(1−α_(TOT) [b,t])N[b,t]   (20b)where

$\begin{matrix}{{\alpha_{TOT}\lbrack {b,t} \rbrack} = \{ \begin{matrix}\alpha_{TOTattack} & {{N\lbrack {b,t} \rbrack} > {\overset{\_}{N}\lbrack {b,{t - 1}} \rbrack}} \\{{{\overset{\_}{A}\lbrack t\rbrack}\alpha_{TOTrelease}} + ( {1 - {\overset{\_}{A}\lbrack t\rbrack}} )} & {{N\lbrack {b,t} \rbrack} \leq {\overset{\_}{N}\lbrack {b,{t - 1}} \rbrack}}\end{matrix} } & ( {20c} )\end{matrix}$

Finally one may compute a target specific loudness based on the smoothedscaling applied to the original specific loudness{circumflex over (N)}[b,t]=S _(TOT) [b,t]N[b,t]   (21)and then solve for gains G[b,t] that when applied to the originalexcitation result in a specific loudness equal to the target:{circumflex over (N)}[b,t]=Ψ{G ² [b,t]E[b,t]}   (22)The gains may be applied to each band of the filterbank used to computethe excitation, and the modified audio may then be generated byinverting the filterbank to produce a modified time domain audio signal.

Additional Parameter Control

While the discussion above has focused on the control of AGC and DRCattack and release parameters via auditory scene analysis of the audiobeing processed, other important parameters may also benefit from beingcontrolled via the ASA results. For example, the event control signalĀ[t] from Equation 12 may be used to vary the value of the DRC ratioparameter that is used to dynamically adjust the gain of the audio. TheRatio parameter, similarly to the attack and release time parameters,may contribute significantly to the perceptual artifacts introduced bydynamic gain adjustments.

Implementation

The invention may be implemented in hardware or software, or acombination of both (e.g., programmable logic arrays). Unless otherwisespecified, the algorithms included as part of the invention are notinherently related to any particular computer or other apparatus. Inparticular, various general-purpose machines may be used with programswritten in accordance with the teachings herein, or it may be moreconvenient to construct more specialized apparatus (e.g., integratedcircuits) to perform the required method steps. Thus, the invention maybe implemented in one or more computer programs executing on one or moreprogrammable computer systems each comprising at least one processor, atleast one data storage system (including volatile and non-volatilememory and/or storage elements), at least one input device or port, andat least one output device or port. Program code is applied to inputdata to perform the functions described herein and generate outputinformation. The output information is applied to one or more outputdevices, in known fashion.

Each such program may be implemented in any desired computer language(including machine, assembly, or high level procedural, logical, orobject oriented programming languages) to communicate with a computersystem. In any case, the language may be a compiled or interpretedlanguage.

Each such computer program is preferably stored on or downloaded to astorage media or device (e.g., solid state memory or media, or magneticor optical media) readable by a general or special purpose programmablecomputer, for configuring and operating the computer when the storagemedia or device is read by the computer system to perform the proceduresdescribed herein. The inventive system may also be considered to beimplemented as a computer-readable storage medium, configured with acomputer program, where the storage medium so configured causes acomputer system to operate in a specific and predefined manner toperform the functions described herein.

A number of embodiments of the invention have been described.Nevertheless, it will be understood that various modifications may bemade without departing from the spirit and scope of the invention. Forexample, some of the steps described herein may be order independent,and thus may be performed in an order different from that described.

It should be understood that implementation of other variations andmodifications of the invention and its various aspects will be apparentto those skilled in the art, and that the invention is not limited bythese specific embodiments described. It is therefore contemplated tocover by the present invention any and all modifications, variations, orequivalents that fall within the true spirit and scope of the basicunderlying principles disclosed and claimed herein.

INCORPORATION BY REFERENCE

The following patents, patent applications and publications are herebyincorporated by reference, each in their entirety.

Audio Dynamics Processing

-   Audio Engineer's Reference Book, edited by Michael Talbot-Smith,    2^(nd) edition. Limiters and Compressors, Alan Tutton, 2-1492-165.    Focal Press, Reed Educational and Professional Publishing, Ltd.,    1999.

Detecting and Using Auditory Events

-   U.S. patent application Ser. No. 10/474,387, “High Quality    Time-Scaling and Pitch-Scaling of Audio Signals” of Brett Graham    Crockett, published Jun. 24, 2004 as US 2004/0122662 A1.-   U.S. patent application Ser. No. 10/478,398, “Method for Time    Aligning Audio Signals Using Characterizations Based on Auditory    Events” of Brett G. Crockett et al, published Jul. 29, 2004 as US    2004/0148159 A1.-   U.S. patent application Ser. No. 10/478,538, “Segmenting Audio    Signals Into Auditory Events” of Brett G. Crockett, published Aug.    26, 2004 as US 2004/0165730 A1. Aspects of the present invention    provide a way to detect auditory events in addition to those    disclosed in said application of Crockett.-   U.S. patent application Ser. No. 10/478,397, “Comparing Audio Using    Characterizations Based on Auditory Events” of Brett G. Crockett et    al, published Sep. 2, 2004 as US 2004/0172240 A1.-   International Application under the Patent Cooperation Treaty S.N.    PCT/US 05/24630 filed Jul. 13, 2005, entitled “Method for Combining    Audio Signals Using Auditory Scene Analysis,” of Michael John    Smithers, published Mar. 9, 2006 as WO 2006/026161.-   International Application under the Patent Cooperation Treaty S.N.    PCT/US 2004/016964, filed May 27, 2004, entitled “Method, Apparatus    and Computer Program for Calculating and Adjusting the Perceived    Loudness of an Audio Signal” of Alan Jeffrey Seefeldt et al,    published Dec. 23, 2004 as WO 2004/111994 A2.-   International application under the Patent Cooperation Treaty S.N.    PCT/US2005/038579, filed Oct. 25, 2005, entitled “Calculating and    Adjusting the Perceived Loudness and/or the Perceived Spectral    Balance of an Audio Signal” by Alan Jeffrey Seefeldt and published    as International Publication Number WO 2006/047600.-   “A Method for Characterizing and Identifying Audio Based on Auditory    Scene Analysis,” by Brett Crockett and Michael Smithers, Audio    Engineering Society Convention Paper 6416, 118^(th) Convention,    Barcelona, May 28-31, 2005.-   “High Quality Multichannel Time Scaling and Pitch-Shifting using    Auditory Scene Analysis,” by Brett Crockett, Audio Engineering    Society Convention Paper 5948, New York, October 2003.-   “A New Objective Measure of Perceived Loudness” by Alan Seefeldt et    al, Audio Engineering Society Convention Paper 6236, San Francisco,    Oct. 28, 2004.-   Handbook for Sound Engineers, The New Audio Cyclopedia, edited by    Glen M. Ballou, 2^(nd) edition. Dynamics, 850-851. Focal Press an    imprint of Butterworth-Heinemann, 1998.-   Audio Engineer's Reference Book, edited by Michael Talbot-Smith,    2^(nd) edition, Section 2.9 (“Limiters and Compressors” by Alan    Tutton), pp. 2.149-2.165, Focal Press, Reed Educational and    Professional Publishing, Ltd., 1999.

We claim:
 1. A method for processing an audio signal in an audioprocessing apparatus, the method comprising: receiving the audio signal,the audio signal comprising at least two channels of audio content;dividing the audio signal into at least a subband signal, wherein thesubband signal comprises at least one subband sample; deriving a powermeasure of the audio signal; smoothing the power measure to generate asmoothed power measure of the audio signal; detecting a location of anauditory event boundary by monitoring the smoothed power measure,wherein an audio portion between consecutive auditory event boundariesconstitutes an auditory event, wherein the detecting further includesapplying a threshold to the smoothed power measure to detect thelocation of the auditory event boundary; generating a gain vector basedon the location of the auditory event boundary; and applying the gainvector to the audio signal; wherein the audio processing apparatus isimplemented at least in part with hardware.
 2. The method of claim 1,wherein the characteristic further includes loudness.
 3. The method ofclaim 1, wherein the characteristic further includes perceived loudness.4. The method of claim 1, wherein the characteristic further includesphase.
 5. The method of claim 1, wherein the characteristic furtherincludes a sudden change in signal power.
 6. A non-transitorycomputer-readable storage medium encoded with a computer program forcausing a computer to perform the method of claim
 1. 7. An audioprocessing apparatus, the apparatus comprising: an input interface forreceiving the audio signal, the audio signal comprising at least twochannels of audio content; a filter bank for dividing the audio signalinto a plurality of subband signals, each of the plurality of subbandsignals including at least one subband sample; and a processor that:derives a characteristic of the audio signal, wherein the characteristicis a power measure of the audio signal; smooths the power measure togenerate a smoothed power measure of the audio signal; detects alocation of an auditory event boundary by monitoring the smoothed powermeasure, wherein an audio portion between consecutive auditory eventboundaries constitutes an auditory event, wherein the detecting furtherincludes applying a threshold to the smoothed power measure to detectthe location of the auditory event boundary; generates a gain vectorbased on the location of the auditory event boundary; and applies thegain vector to the audio signal; wherein the audio processing apparatusincludes at least some hardware.